-
Notifications
You must be signed in to change notification settings - Fork 181
Moved management of shader module creation and lifetime to its own class. #444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…e implementation Signed-off-by: softcookiepp <[email protected]>
Signed-off-by: softcookiepp <[email protected]>
Signed-off-by: softcookiepp <[email protected]>
axsaucedo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have done an initial review, seems like still a way to go, there's a few inconsistencies that need to be addressed, as well as caveats on existing approach and coding style. There also doesn't seem to be a fully complete implementation as this suggests an optionally owned resource but currently the PR does not introduce the functionaly to make it optionally owned. We also need to ensure new functionality is tested, especially in contexts where these resources are owned, and particularly assessing the memory management / lifecycle aspect.
src/include/kompute/Cache.hpp
Outdated
| @@ -0,0 +1,27 @@ | |||
| #pragma once | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is quite random, what is this here for - remove
src/include/kompute/Shader.hpp
Outdated
| /* | ||
| * getter for mShaderModule | ||
| */ | ||
| vk::ShaderModule& getShaderModule() { return mShaderModule; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be const and should not return a mutable reference
| public: | ||
|
|
||
| /* | ||
| * Constructor accepting a device and a SPIR-V binary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment conventions should align to rest of codebase - see manager.hpp
src/include/kompute/Shader.hpp
Outdated
| // the vulkan device; not owned by this object | ||
| std::weak_ptr<vk::Device> mDevice; | ||
|
|
||
| // the shader module handle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Superfluous comment, see other classes for examples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- needs clarity on ownerhsip as this is sometimes owned resource
src/include/kompute/Shader.hpp
Outdated
| * Wrapper for Vulkan's shader modules. | ||
| * The purpose of this is to manage the module lifetime, while | ||
| * building the groundwork for easily integrating things like | ||
| * SPIR-V reflection and multiple entry points in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewrite as description of class as of now not for future / groundwork
src/include/kompute/Shader.hpp
Outdated
| * building the groundwork for easily integrating things like | ||
| * SPIR-V reflection and multiple entry points in the future. | ||
| */ | ||
| class Module : public std::enable_shared_from_this<Module> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming convention here is not consistent, should be Shader as per file - module is a generic keyword
src/include/kompute/Algorithm.hpp
Outdated
| bool mFreeDescriptorSet = false; | ||
| std::shared_ptr<vk::ShaderModule> mShaderModule; | ||
| bool mFreeShaderModule = false; | ||
| std::shared_ptr<Module> mModule = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is suggested as optionally owned resource but it's not handled as such, follow the same convention as per the rest of the optionally owned resources
src/Shader.cpp
Outdated
| spv.size()); | ||
| vk::ShaderModuleCreateInfo shaderModuleInfo(vk::ShaderModuleCreateFlags(), | ||
| sizeof(uint32_t) * spv.size(), spv.data()); | ||
| this->mDevice.lock()->createShaderModule( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we don't use device.lock throughout the codebase, let's ensure consitency
src/include/kompute/Shader.hpp
Outdated
| class Module : public std::enable_shared_from_this<Module> | ||
| { | ||
| // the vulkan device; not owned by this object | ||
| std::weak_ptr<vk::Device> mDevice; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Device is not a weakptr as we need to ensure that this resource doesnt outlive the lifecylce by storing as strong reference and/or ensuring this is handled correctly
src/Shader.cpp
Outdated
| { | ||
| KP_LOG_DEBUG("Kompute Module destructor started"); | ||
| KP_LOG_DEBUG("Kompute Module Destroying shader module"); | ||
| if (!mDevice.expired() ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure consistency with other classes, including coding style
|
Alright, that should cover everything. |
axsaucedo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @softcookiepp - all tests are failing, let's make sure tests pass locally before re-submitting
|
Right now I cannot even get the tests to run on my system due to this issue: #445 |
|
Oh right I see, you could also try running it with |
|
Thanks! Will do. |
|
Alright, I got all the tests passing. I also changed the vk::Device queue family selection criteria from just The queue family selection in general could use some reworking to prevent invalid API usage (see here: #445 (comment)) but I will leave that for another day. |
axsaucedo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further changes requested - some strange changes that don't have anything to do with the PR, let's avoid this. Let's also make sure that this code is tested by evaluating an Algo rebuild as this would trigger the full cycle.
| for (uint32_t i = 0; i < numParallel; i++) { | ||
| inputsAsyncB.push_back(mgr.tensor(data)); | ||
| algosAsync.push_back(mgr.algorithm({ inputsAsyncB[i] }, spirv)); | ||
| inputsAsyncB.push_back(mgrAsync.tensor(data)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you changing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test was wrong and was causing my driver to crash. It was trying to use different kp::Manager (and therefore different underlying vk::Device) instances for allocating buffers, creating pipelines and executing them. You can't do this, even if the underlying hardware is the same. If you read the Vulkan validation errors while running the tests, you would see this:
[Jan 8 2026 11:08:47] [debug] [Manager.cpp:42] [VALIDATION]: Validation - vkCmdPipelineBarrier(): pBufferMemoryBarriers[0].buffer (VkBuffer 0x280000000028) was created, allocated or retrieved from VkDevice 0x5a2129296950, but command is using (or its dispatchable parameter is associated with) VkDevice 0x5a2129386e80
My change here fixes this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still isn't completely fixed; see here for details: #445 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this is not "fixing" anything. This is a test that is currently set up on specific hardware, which is testing parallel execution. What you are doing is just not running the test. This is not "fixing" something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still crashes, but the validation error in my message no longer appears when I attempt to run the test. So it is a step in the right direction towards compliant API usage, but not a complete fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to see what this validation error may be, but your change basically makes this test redundant, so it's not correct. If you read through the test you are making it to not compare anything; the test is profiling a parallel queue with a non-parallel queue on a specific GPU that actually supports parallel processing (not async, parallel). Here's more info: https://medium.com/data-science/parallelizing-heavy-gpu-workloads-via-multi-queue-operations-50a38b15a1dc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The simplified reasoning is that you are taking an underlying vk::PhysicalDevice and creating two different vk::Device instances with it, since each kp::Manager creates its own vk::Device. Though these may have the same physical GPU, the Vulkan driver may initialize separate resources every time a vk::Device is created for a given GPU.
What do these resources look like? It really depends on the driver. NVIDIA drivers (which given your comments, it seems you have) are more robust in this regard, while AMD drivers (which I am using right now) are quite a bit more unforgiving about things like this.
So when you try to use memory and pipelines allocated with mgr (which has its own vk::Device instance) with mgrAsync (which has a totally separate vk::Device), you are violating one of the core assumptions that Vulkan drivers make. You may as well be asking for it to mix and match resources created on different physical GPUs.
As for why the test is redundant now, I am really not sure; both mgr and mgrAsync are still used in this test. I just made sure mgrAsync did not accept tensors and algorithms created with mgr. If you could point to exactly which lines break the test and why, that would be very much appreciated.
axsaucedo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few more comments but seems almost there. Are there any tests that we can add as well? Seems to me this segfault piece would benefit from one. We also need to make sure that the documentation is updated. Once these are resolved we should be able to merge. Thank you.
|
Thanks @softcookiepp - almost there, I just added a follow up reply to claify the tests |
As the title says, the creation of vk::ShaderModule objects has been moved from kp::Algorithm kp::Module, a new class.
This will make it easier to implement caching of vk::ShaderModule objects, as well as add aditional features for shader module metadata handling in the future.